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1 Introduction 


Various space applications require the design of autonomous systems capable of perceiving, reasoning, 
and acting in a complex non- stationary environment. Moreover, in order to achieve a maximum degree of 
autonomy, these systems should be self-calibrating, i.e. highly adaptive to changes in their structure. In 
traditional robotics perception, reasoning, and motor control have been treated as mainly independent 
modules due to the strong reductionism of mainstream artificial intelligence. While various artificial 
intelligence algorithms have been proposed for these modules, these algorithms have been proved to be 
inadequate for non- stationary environments. For example, vision has been traditionally separated into 
two parts, preattentive and postattentive vision. The former has been formalized as an “inverse optics” 
problem while the latter as a symbol manipulation problem. Inverse optics attempts to invert geometric 
and radiometric equations relating the properties of the three-dimensional environment to the shape 
and luminance of the. image. The inverse problem is ill posed, and traditional computer vision uses ad- 
ditional assumptions to reduce the solution space. However, these assumptions imply strong constraints 
on the environment. When these constraints are not satisfied the performance of the algorithm deterio- 
rates severly. Therefore, these approaches are inadequate for non-stationary environments. A powerful 
technique for abstract reasoning is the expert system paradigm. Again, the success of expert systems 
is limited to stationary environments. In motor control, traditional approaches use a reference model 
for the plant to be controlled (e.g. arm). Even in the presence of such a model, the inverse problems 
are ill posed and require additional assumptions. Furthermore, the parameters of the actuator change 
through time (for example due to fatigue) and traditional robotic systems require human intervention 
for calibration. In order to avoid such interventions, a non-stationary approach becomes desirable. 

A recent approach in robotics attempted to avoid these difficulties by simplifying the desired behavior 
(instead of the environment as in the approaches mentioned above) (Brooks, 1986, 1989). This enabled 
the design and hardware implementation of autonomous systems with simple behaviors such as escape 
and wander. The simplification of the behavior enables* one to cross the perception-reasoning-action 
stages without designing complex structures for each. In fact, the system proposed by these authors is ar 
“hard- wired production-rule; system” with* simple rules (here, simple rules are sufficient because simple 
behaviors are- sought). However, this approach potentially faces the same problems that traditional 
artificial intelligence faced when it tried to generalize the “block world” techniques to complicated 
environments, for it does not address the fundamental shortcoming of these approaches: the* lack of 
self-organization tailored for non-stationary environments. In that respect, various neural network 
approaches that have been proposed for robotics are also inadequate because, while they are capable of 
self-organization, they cannot cope with non-stationary environments. 
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In this report, we describe a neural network robotic system designed to self-adapt in non-stationary 
environments. As we will discuss, active exploration plays an essential role in our approach. This 
stems from our conceptualization of intelligence as a process rather than a fact. Consequently, the 
understanding and the emulation of intelligent behavior requires the characterization of its dynamics 
rather than just its equilibria. This led us to study minimal systems with active exploratory capabilities. 
The system has been implemented and tested in software. We also describe the details of various 
implementations. 

2 General principles 

Unlike many primitive animals which are almost completely genetically wired, human infants undergo 
an extensive developmental period, during which they learn to control and coordinate various parts of 
their body. This self-organization process is highly adaptive for it is able to control a non-stationary 
system (e.g. the growing child’s arms become longer etc.). Demonstrations of these effects in adults 
are quite old. Helmholtz showed that adults can adapt to inverting prisms placed in front of their 
eyes. While many interpretations of this adaptation process have been based on an adult error de- 
tection and correction behavior, Held and co-workers defended the view that a single mechanism is 
responsible for both infant sensory-motor coordination and adult sensory-motor adaptation(Held and 
Bossom, 1961; Held and Hein, 1963; Held, 1965). Furthermore, their experiments suggested that an 
active control of muscles is necessary for sensory-motor adaptation. The role of active processes as a 
basis of self-organization goes beyond sensory-motor coordination. Piaget’s studies suggested that ac- 
tive exploration plays an essential role in the development of intelligence in addition to perception and 
motor control(Piaget, 1963, 1967, 1969, 1970). Within this framework these three aspects of behavior 
develop together by continuously being influenced by and-influencing each other. . \ ; 

Given this primary role for active exploration in intelligent behavior, the fundamental question- 
that we posed is “what are the simple circuits and organizational principles- that are necessary for the 
initiation of an exploratory behavior ?” To answer this question, we analyzed the first step of exploration, 
reaching- out for targets. The system that we designed has sensorial inputs 4 and;* motor outputs and J a 
“cognitive unit” that coordinates' these two. An important step in intelligent behavior is stimulus 
generalization. To achieve this behavior we introduced categorization circuits. While there are various 
categorization circuits proposed in the neural network literature, in order to satisfy our requirement 
of non-stationary environments, we chose adaptive resonance theory architectures that are capable of 
forming stable categories in nonstationary environments( Carpenter and Grossberg, 1987, 1988). Simple 
adaptive resonance circuits require that the inputs are pre-processed by a figure-ground segregation 
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network. To achieve a very simple figure-ground segregation, we introduced a non-homogeneous retina 
consisting of a high resolution fovea and a periphery. The figure-ground segregation is achieved by 
directing the fovea to parts of the image such that the-foveal signal is treated as figure and the peripheral 
signal as ground. This in turn required that we introduce a circuit for controlling eye movements. The 
basic circuit for eye movements should be flexible to operate under both sensorial and cognitive control 
to allow an exploratory behavior that accommodates both sensorial and cognitive cues. For that task, 
we modified some neural circuits proposed for optokinetik behavior in lower animals as the seed of the 
eye movement system(Ogmen and Gange 1990a, 1990b). An important characteristic of this circuit is its 
sensitivity to novelty which is also crucial for exploration. In addition to sensitivity to spatial novelty, the 
system should be able to recognize the novelty of abstract stimuli representations (categories). Another 
important property for exploration is the success of behaviors. In simple developing systems the success 
of behaviors are determined by reinforcement signals and thus the system should be capable of operant 
conditioning. However like any signal in an uncontrolled environment, reinforcement signals can also be 
noisy. In order to filter the transients in the reinforcement signals another property is required: habits. 
In sum, a simple neural network should have stimulus generalization, novelty, reinforcement learning 
and habit formation properties. Whether these properties are sufficient is the question we attempt to 
answer by building and analyzing neural network architectures having these properties. The complete 
systems contains various interacting sub-architectures. We will first start describing these architectures 
and the modifications to each architecture for the needs of the present application. We will then present 
the combined model that integrates these architectures into a global system. We finally conclude by 
discussing future directions in this research. 

3 Reinforcement learning and habits for reinforcement filtering 

3.1 Neurophysiological basis 

Assimilation and avoidance of a behavioris-highly^ correlated^with^rewaTcf and punishmentr sigrtalstr Anh - 
mals and humans generally tend to favor rewarding events to punishing ones. Moreover, in certain cases 
the absence of reward could act as punishment and vice versa i.e., not receiving punishment can act as 
a reward. Studies by Milner and Pribram have suggested that the frontal lobes in primates and humans 
(see Figure 1) are primarily responsible for correlating the reward and punishment to actions (Milner, 
1963, 1964; Pribram, 1961). Milner found that patients with frontal lobe lesion demonstrated difficulty 
in shifting their response to changing environment. For example- if a certain action (like eating sweets), 
which was formerly rewarding but later turned to be punishing (caused a stomach ache due to overeat- 
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Parietal Lobe 




Figure l: The human brain can be divided externally in to four lobes, the occipital lobe (responsible for 
visual activity), the parietal lobe (responsible- for abstract thinking e.g logic, math: etc.,), the temporal- 
lobe (responsible for hearing and memory) and the frontal lobe (responsible for goal- directed behavior, 
articulation of speech, and control of discrete movement of fchebod7). 




No. of 

Categories 

Total 

Per- 

Nonper- 

Locus of lesion 

cases 

achieved 

errors 

severative 

severative 

Dorsolateral frontal 

25 

1.372 

75.44 

56.932 

18.5 

Control: 






Orbitofrontal + temporal 

i 

7 

4.9 

27.6 

12.0 

15.6 

Inferior frontal 

1 

6 

13 

4 

9 

Bilateral hippocampal 

1 

4 

48 

11 

37 

Posterior cortex 

60 

4.52 

32.96 

16.64 

16:32 


Table 1: Milner’s summarized results 


ing), then these patients found it hard to change their original behavior. Her test group consisted of 
71 patients who were tested both before and after frontal lobectomy (removal of the frontal lobe for 
relief from some kinds of seizures). Her test group also included 23 patients who had previously re- 
ceived frontal lobectomies (in some cases years ago). The test which she used on these subjects dealt 
with matching of similar cards and is called the “Wisconsin Card Test”. The subjects were shown four 
“stimulus cards” (see Figure 2) which consisted of cards differing in color, form and number: one red 
triangle, two green stars, three yellow crosses and four blue circles. The subjects were then given a pack 
of 128 “response cards” which were a combination of the above criteria, i.e the four colors, forms and 
numbers. Figure 2 shows one such response card (two red crosses) which corresponds to the stimulus 
card 1 in color, to the stimulus card 2 in number, and to the stimulus card 3 in form. The subjects 
were then told to place the response card in front of one of the four stimulus cards, wherever they 
thought it should go. They were then informed whether their choice was “right” or “wrong”. The 
subjects used this information and tried to get as many cards right as they could. No other cues were 
given. The subjects were required to sort first by a certain color, all other responses being called wrong. 
Then once they r "achieved 10 consecutive correct responses to color, the sorting criterion was changed 
to form, without, previous warning, and color response* were deemed wrong. After the subjects gave 
10 consecutive? correct response* to form-^the?' criterion wa* shifted: to number and -then back- to- color 1 
again. This procedure' continued until the. subjects had successfully completed six. sorting categories 
(color, form, number, color, form, number) or until all 128 cards had been placed. 

Milner found that the subjects who did not have dorso-lateral frontal lobectomy could change their 
responses as the sorting criterion changed. However subjects who had dorso-lateral frontal lobectomy 
could sort the cards correctly only for the first criterion and continued to sort according to this criterion 
even after the criterion was shifted to form by the experimenter. Milner categorized the “wrong” 
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responses given by subjects into two categories. The first was the “perseverative errors” which consisted 
of responses which would have been correct on the immediately preceding stage of the test, or, in the 
first stage as a continued response in terms of the subject’s initial preference (which is incorrect now). 
All other errors were lumped together and called as “non-perseverative errors”. Table 1 summarizes 
the findings of Milner. As can be seen from Table 1, subjects with dorso-lateral frontal lobectomy were 
unable to shift from one sorting principle to another. They committed an average of 57 perseverative 
errors. Milner reported that in some cases these subjects were unable to shift their criterion throughout 
the entire experiment. The control subjects on the other hand could change their criterion and had about 
15 perseverative errors. Non-perseverative errors in general were- not as pronounced as perseverative 
errors except in the case of subjects who had bilateral hippocampal lesions. This anomaly is attributed 
to a profound memory loss. 

3.2 Significance 

Thus Milner’s experiment suggests that frontal lobes play a major role in enabling one to shift his/her 
response to meet changing environment. This is essential for successful performance in nonstationary 
environment. Yet, as mentioned earlier, reinforcement signals can be noisy and normal humans filter 
those signals by mixing reinforcement and habit tendencies. Milner’s result suggest that if the gain of the 
reinforcement signal are diminished (due to frontal lobe damage) then habit filters become dominant and 
cause extreme persevaration of behavior. While Milner’s study is directed to anamolous brain function, 
it reveals an important aspect of reinforcement filtering (habits) which is otherwise transparent to the 
observer. 

3.3 Neural models 

Leven and Levine proposed a neural network to model some roles of frontal lobes in the decision making 
process(Leven and Levine, 1987). Their- model comprised of two parts as shown in Figure- 3. The 
left hand side comprises of the~ cat egorizatibn network wbrch' is responsible for choosing the “correct”* 
category for the given input . “response-card” and the network on* the right hand side comprises of the 
habit and bias network. 

The categorization network comprises of an Adaptive Resonance Theory (ART) network. ART 
has of two layers: the input layer consisting of feature neurons and the categorization (output layer) 
consisting of category neurons. Each of the feature neurons codes a particular criterion: color, shape 
or number of the input “response card”. There are a total of twelve neurons in this layer and each 
criterion is represented by four neurons (one for each of the four attributes in a given criterion). Hence 
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a particular response card, say two red crosses, is represented by the input vector (0100 1000 0010) where 
the first four from the left digits represent the number criterion: one, two, three and four, the next four 
digits represent the color criterion: red, green, yellow and blue and the final four digits represent the 
form criterion: triangle, stars, crosses and circles respectively. Any response card would activate one 
of the four neurons of a given criterion. The above input would activate the x(2], x[5] and the x[ll] 
neurons of the input layer. 

The categorization layer comprises of four neurons which represent the four “stimulus cards”. They 
are one red triangle, two green stars, three yellow crosses and four blue circles. A given input “response 
card” would match one of the four “stimulus card” category neurons depending on reward input. Ini- 
tially, when a “response card” is shown to the categorization network the feature neurons which code 
the attributes of this response card become active. The activity of a feature neuron is given by the 
following differential equation. 


dx{ 

dt 


4 


4 


-Axi + {B - Cxi)(Ii -f Y f(Vj) z j,i) ~ Dx i Y /(W)» 

i = i i = i 


* = 1 , 2 , 12 , 


( 1 ) 


where /{ is the input vector given to the feature layer (e.g: two red crosses would be represented as 
(010010000010) = (/i,/ 2 , ....,/i 2 ) = I), X{ represents the activity of the i— feature neuron , yj is the 
activity of the j— category neuron, and Zj^ represents the top down weights connecting the j— category 
neuron with the i— feature neuron. A, B , C and D are positive constants and “/” is a sigmoid function 
which is defined as follows. 


f(x) = arctan(x — 1) H — (2) 

2 

Equation (1) is a shunting equation(Grossberg, 1988) and the activity of feature neurons is bounded 
(in this case between zero and -^). The first term - Ax{ in the above differential equation is responsible 
for the passive decay of activity at rate A. The second term (B — Car t )(/ t - +- is .thev 

excitatory term and it consists of the input /{. and the. excitation- from ther category nodes^ 1 weighted 
by the top down- weight Zj ti . The third 7 and' fihaF ter mr theinhibitory^ parfrwhich 

comprises of inhibition from the categorization neurons.. This inhibition allows the network to distinguish 
between bottom up and top down signals. 

The activity of the category neurons represent the possibility that the input “response card” belongs 
to that “stimulus card” category. This activity is represented by equation (3). 
d 12 

-J- = -■ A vj + ( B - c yj){f(yj) + Y 

i—l 

1 “node” and “neuron” have been used synonymously 
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CATEGORIES Q? 2 >- 



Figure 3: The network developed by Leven and Levine, based on ART, to simulate Milner’s card 
sorting data* The reinforcement signal R is applied to the bias nodes fh The bias nodes in turn 
gate the- category neurons. The bias neurons also get input from the habit neuron, which encodes past 
categorizations. The match signals (which state the particular criterion of the input that was responsible 
for categorization) are encoded by $ nodes. The feature neurons encode the features of the input 
card and categorization is achieved depending on previous reinforcements 
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3 = 1 , 2 , 3 , 4 , 


( 3 ) 


= -Dyj(Ef(Vr) + I), 

One of the excitatory term in the above equation, j, consists of feature' node activities 

gated by bias nodes and bottom up weights. The other excitatory term is the positive self feedback of 
the node itself. The inhibitory term consists of mutual lateral inhibition given by f{Vr)” and the 

reset signal “X”. The function “j” in the above- differential equation is as follows. 

g(x) = 0, x < 0.5 

x — 0.5, 0.5 < x < 3 (4) 

2.5, x > 3. 

When a “response card” is input to the network, the feature neurons that encode the characteristics 
of that input become active. These active feature neurons excite the respective category neurons. 
Consider the example of two red crosses input “response card” again. This card could excite “category 
1” (because of color), “category 2” (because of number) or “category 3” (because of shape). Now, 
suppose that categorization with respect to color has been rewarding in the previous attempts. Then, 
the habit and bias nodes encoding color feature would be more active than the other criterions. This 
would lead to a greater activity of the second category neuron. The faster than linear function for the 
self-feedback and lateral inhibition terms causes the second category neuron to become more active and 
also, at the same time, suppress the activity of the other category neurons 2 . Thus, the input “response 
card” would be matched to the first “stimulus card” category since the previously rewarding criterion 
had been color. 

The habit and bias network are comprised of three neurons, one for each criterion: number, color 
and shape. The habit nodes detect how often a dimension (number, color or shape) is used to determine 
categorization of the “response card” regardless of whether that categorization is rewarded or punished. 
The activity of the habit nodes is given as follows 

where 0 2 , H\ and J are positive constants. The functions [x]+ and [®]~ used in the above differential 
equation imply the following 

[x]+ — x if x > 0 

= 0 else (6) 

2 For a detailed study of different type of functions i.e slower than linear, linear and greater than linear and the roles 
they play in lateral inhibition and self- excitation the reader is referred to (Grossberg,. 1973). 
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M = -X- 


= 0 


if x < 0 
else 


( 7 ) 


is called the “match signal”, and is given by 

4/e 

= 'y y Zj f ili , k = 1,2,3 (8) 

i=4fc-3 


where j is the index of the chosen category, z Jt ; is the top-down weights and is the input vector 
of the “response card”. For example, if the input card is two red crosses (i.e., 0100 1000 0010) and 
sorting based on color had been rewarding, then the input card would be sorted by color. For the habit 
node representing color, k = 2. Thus, the range of the summation for calculating the match signal 
goes from Is to 7g. Now 7, is positive for i=5 (the color red) and Ii = 0 for i= 6,7,8. Hence, the top 
down weights of only the “stimulus card” category chosen (i.e j'=l , which represents one red triangle) 
would be gated with the input signals Is to 1% to form the match signal for the number match node. 
This ensures that only the criterion that was used to sort the input card is credited with the successful 
match. Thus only the “match signal” containing the information regarding color becomes much larger 
than the other match nodes. Hence the activity of the habit node for which the match signal is large 
would increase and the activity of the rest of the habit nodes would decrease. The activity of a habit 
node for a particular dimension corresponds to the past frequency with which that dimension was used 
to categorize the’ input “response card”. 

The Bias nodes, on the other hand, encode the recent past success of using the corresponding 
dimension (number, color, shape) to categorize the input “response card”. As such, the Bias nodes 
receive direct input from the reinforcement signal. A positive reinforcement signal (iZ + ) implies that 
the criterion used to sort the input “response card” is correct and a negative reinforcement signal (i2~) 
implies the opposite. The activities of the bias nodes are given below. 


_ = -Ettk -h {.( F- Itk )([htr - 9x] + + + g(ttk)) 

-Q k {aR~ +G'£ i g{n r ))}f($k) k = 1 , 2 , 3 . ( 9 ) 

■■ - >•**>" • 

In the above equation both the excitatory and theinhibitory terms are-gated by the match signals so 
that the reinforcement signals (either positive ornegative) are conveyed to the appropriate bias neuron. 
Like the habit nodes there are three bias neurons. The excitatory term in the above equation, that is 
( F — - 0 X ] + 4- aR + + p(f2fc)), comprises of an external reward signal R + multiplied by a gain 

factor a,, the habit node activity and a positive self feedback term ^(0*.). The inhibitory term consists 
of Q k {aR~ 4- G^2 T ^. k g( fi r )) where R~ is the external punishment signal multiplied by the gain a and 
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the activities of the rest of the bias nodes. Coming back to our example (i.e the two red crosses) suppose 
that the sorting of the input “response card” according to color was correct and the reward signal (i? + ) 
was briefly turned high. 3 This would increase the activity activity of the color bias node which, would 
encourage future sortings of input “response card” by color. The increase in the activity of “color” 
bias node would cause the decrease in the activity of the other two bias nodes due to inhibition (thus 
indirectly leading to an increase in its own activity). On the other hand, supposing the sorting criterion 
used to categorize the input “response card” was inappropriate, then, a punishment signal would be 
given which would cause a decrease in the activity of the respective (in this case the “color”) bias node. 
This would lead to an increase in the activities of the other two bias nodes. Eventually categorization 
of future input “response cards” would be by a criterion other than “color”. 

The gain term a that multiples the reward and punishment signals represents the influence external 
signals have on the categorization of input “response cards”. Normal subjects capable of changing their 
criterion of sorting according to external reinforcement signals have a high value for a. Those who 
are incapable of this task due to lesions in their frontal lobes have a low value for a. Thus the above 
network illustrates how the criterion for categorization is dependent on external reinforcement signal. 

3.4 Modifications 

The model proposed by Leven and Levine(Leven and Levine, 1987) as discussed above comprises of 
two parts: the Categorization network and the Habit and Bias network. The simulations that they 
conducted using the above differential equations also were conducted in a two stage manner. In the first 
stage the input vector was presented to the Categorization network which sorted the input “response 
card” to a specific category depending on which of the criterions had been rewarding. Once the input 
had been sorted to a particular category the match signal was calculated algorithmically for the given 
input. Then, in the second stage, the Habit and Bias node activities were simtdated along with the 
reinforcement signal. . * " - . ’ - . 

In order to obtain 'a- continuous- timeTio'n-algori'tlimicTrio^d^ 
by dynamic match, signal nodes. Moreover,, in order to make the system self-aware. as_ to. when, a 
categorization choice was made, a decision layer was incorporated. A cognitive-node was also-designed 
to solve the ambiguity in case a decision regarding the category was not made. 

3 The reward and punishment signals and R~ are large positive and negative pulses which. cause a fast “rise” of 
the bias node response. The decay time constant of the bias and habit nodes are much smaller than the time constants 
of the feature and category neurons. Hence the activity of the bias and habit nodes tend to change much slower than the 
activities of the feature and category nodes during categorization of input “response cards”. 
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Figure- 4:. The modified version of theEeven and Levine model consists of an additional decision layer 
and an ambiguity neuron. The decision layer ensures that only one of the category is selected. The 
ambiguity neuron resolves any indecisions that occur in the network as to how an '‘input card” should- 
be sorted. This indecision is resolved randomly. The match signals are generated when a particular card 
has been categorized according to a particular criterion. The bias nodes modulate the categorization 
according to reinforcement signals and habits. 
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The modified neural network is shown in Figure 4. The dynamic signals formerly computed algo- 
rithmically in the original Leven and Levine (Leven and Levine, 1987) are computed in continuous-time 
in the modified model. The match signal node activity is given below. 


dt 


4k- 4 

-A$ k + (B-C$ k ){ £ '£l j gifa-8i)zi,j} 

j=4k-3i=l 


— D$kl k — 1,2, 3, 


( 10 ) 


A, B , C , 9i and D are positive constants. The match signal comprises of an excitation term ( B - 
c$ k ){Z?= 4fc-3 ~ where Ij is the input vector, pi is the activity of the neurons in the 

decision layer, 6\ is a constant and Zi } j is the top-down weights from the category nodes to the feature 
nodes. The inhibition term given by - consists of a reset signal I (c.f eqn 3) and a summation of 

the other match signals as input to a “linear above threshold function” gi. The function / is the same 
“faster than linear function” used in original model. 

Consider the example discussed previously (in light of the new modifications) in which the input 
“response card” was, two red crosses. That input is now given to the new network. Assume that sorting 
by the “color” criterion had been rewarding. Then the input card would be categorized as being in 
the first category (which corresponds to one red triangle). However, during sorting, various category 
nodes might have varying activities. To generate the correct match signal the categorization must be 
completed. The decision layer which has been included into the modified network ensures this. 

The decision layer, comprised of four nodes (one for each “stimulus card” category), has feed-forward 
connections from the category nodes. The following is the differential equation for one such decision 
node. 


dpi 

dt 


-Alp; + (Bi -pi)Wyi - PiWj2yj i = . 1 , 2, 3,4, 


(ii) 


where W represent a constant gain by which the inputs- to the decision layer are* multiplied: This 
equation represents a shunting; on-center off-surround type feedforward network, pi encodes^ the^ratio 
of the yi node activity to the total activity of the- categorization layer. By. setting: a- threshold level 
for the activity of the decision layer neurons it can be determined if' categorization is achieved. When 
categorization has been achieved, only one- of the decision making neurons is active and that neuron 
represents the category to which the input “response card” has been sorted. 

Consider the case when no previous categorization has been done. When an input card (say two red 
crosses) is given to the categorization network, no biases for sorting the input card according to any 
specific criterion (i.e number, color or shape) exist since neither of the criterions is more preferred than 
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the other. However, in case of humans, since a decision has to be made, one arbitrarily selects one of 
the three category nodes for the card (forced choice paradigm). This higher cognitive decision making 
process is modeled by an ambiguity neuron “n” whose differential equation is given by 

fin 12 4 

_ = r{-An + (12) 

l-l J= 1 

The excitation ( B - Jt} term consists of summation of all input features I{ so as to make 

sure that this neuron is active only when an input card is presented to the categorization network. The 
inhibition term consists of a high gain parameter T which multiplies the summation of the activities of 
the decision layer nodes. This is to ensure that if categorization of the input card has been achieved 
then the ambiguity neuron activity should be inhibited. The inhibition term also consists of a reset 
signal 1 which resets the ambiguity neuron before a new card is input to the categorization network. 
T When the categorization network is incapable of sorting the input card to a particular category, the 
ambiguity neuron introduces a random bias to the inputs of the category neurons which leads to the 
categorization of the input card to one of the category nodes. To incorporate this random bias, the 
activity of the category neurons is modified to the following differential equation. 

= -Ayj + {B - Cyj)(f{yj){ 1 + [n - 0 2 ] + Vj n ) + £ 0( n [i±Jj*»Kj) 

^ i = l 

- D Vj(^2f(yr) + £), 3 = 1 ) 2 , 3, 4 ( 13 ) 

The above differential equation includes only an additional excitatory term (1 + [n - #2 ] + 
comparison to the differential equation for the category neuron in the original model discussed iri the 
previous section. In the above equation Vj n is the random connection weight between the ambiguity 
neurons, and the respective- category nodes and 62 is^ a Threshold constant. 

The differential equations forTeatureneuronsy for the; habit neurons and-thebias neurons*- were similar” 
to those used in the original model described in the previous section. This concludes the modifications 
done to the Leven and Levine’s model. 
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3.5 Simulation Results 


3.5.1 ART network 

An ART-1 network designed by Carpenter and Grossberg was simulated 4 to categorize alp habets( Carpenter 
and Grossberg, 1987). The ART-1 network is shown in Figure 5 and it comprises of two layers, the F\ 
and F 2 which encode patterns of activation in short-term memory (STM). The bottom-up and top-down 
pathways between F\ and F 2 contain adaptive long-term memory (LTM) traces. Various handwritten 
alphabets were captured using a vision system and converted to binary images. These alphabets were 
then used as inputs to the ART-1 network and the various categorizations were observed for different 
vigilance parameters. Figure 6 shows such categorization of alphabets for a vigilance coefficient of 0.95 
and 0.93. The left hand side shows the set of images used to train the network. The figure on the 
right shows the top down weights of the category into which the input was categorized for vigilance 
parameter 0.95 and 0.93. 

3.5.2 Reinforcement-habit network 

Milner(Milner, 1963, 1964) using a modified Wisconsin card sorting test, showed that people with frontal 
damage were incapable of shifting their decision making criterion dynamically in presence of reward and 
punishment. Leven and Levine(Leven and Levine, 1987) modeled this scenario using a neural network 
which was capable of replicating the behavior of both frontally damaged and normal human depending 
on the gain constant a. Figure 4 shows the modified version of the Leven-Levine model. As discussed 
previously the modified model comprises of a decision layer which ascertains whether the category 
neurons have come up with a decision regarding the category to which the input belongs. Figure 7 
demonstrates the case when a category choice is made for a given input. In this simulation the input 
consisted of two red crosses (i.e 0100 1000 0010). The categorization in the previous cases done according 

4 All the simulations were initially conducted on an* Ultrix Workstation. The X-window environment was used- for the 
interactive graphics in the- simulations. All the 5 code was- written. im Cl and" the operating system used^was^Unix: The 
simulations were conducted in a continuous- mode so as to model real life situations. The differential equations were 
solved using Runge-Kutta-Falsberg algorithm which was obtained from the Oak- Ridge- lab; These simulations were then 
ported to the Sparc Station at NASA-JSC on which they are running. Parallel implementation of these simulation was 
also conducted on the Amdahl at NASA-JSC. Each of the simulations consist of a pair of window called the input and 
the output windows. They represent respectively the input to and the response of the modeled neural- network. Other 
windows represent the activity of the various neurons and the strengths of the signals. The simulation once evoked begins 
running continuously and can be interrupted any time by typing Ctrl-C. On doing this a menu pops up listing the various 
options that can be chosen. The various options range from input of a new object to the network, to quitting, of the entire 
simulations. 
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Top-down weights (vigilance 0.95) 

1. A 2_C 3. D 4. E, I 5. F 6. J 7. H 
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Top-down weights (vigilance 0.93) 
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7. D 8. K 



Figure 6: The input comprises of 8 alphabets from ;l a.' T to “k” and. the output gives the top down map 
of the classes into which the various inputs were categorized for vigilance parameter (1.95 and. 0.93~ 
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Figure 7: The four graphs show the response of the four respective category nodes. The results shower 
axe for a simulation in which the input “response card” to the network was 2 red crosses. Since the 
previous categorizations had been by color T the category neuron “1 ,T representing one red triangle has 
the m axim um activity. Thus the input response card has-been* categorized “correctly”* by color. 
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to color had been rewarding and in this case the network should categorize according to color. The 
graph shown in figure 7 consists of the activity of the four, category neurons. As can be i seen from 
the graphs, the category neuron “1” has the highest activity (which represents the one- red triangle) 
implying that the input card was matched according to color. 

The second simulation (figure 8) shows the situation in which categorization was made possible 
by the ambiguity neuron. The response card two red crosses is the input to the model. Initially 
categorization of inputs was not possible because there was no information about any criterion being 
more rewarding than the other. However, the initial indecision is resolved randomly by an ambiguity 
neuron. As can be seen from the graph, the second decision neuron activity rises above the threshold 
implying that the model matched the input to the second category. 

4 Novelty and Reinforcement 

4.1 Neurophysiological basis 

Novelty is yet another facet in our decision making process/ In general new objects attract human 
attention. However, such a new (novel) stimulus looses its attraction once it has been around long 
enough. A baby who is shown a new toy is quickly attracted to it and many times leaves what she is 
holding or doing to acquire the new toy. Thus, the new toy (which is novel) is more attractive to her 
than the other things already existing within her reach (environment). Now, if one were to show her 
a new toy after a sufficient time, she would leave the former toy for the later one, as the former toy 
would have lost its novelty 5 . Novelty plays a important role in the exploratory behavior of humans and 
is vital for learning new things. 

Experiments conducted by Pribram on normal and frontally lesioned rhesus monkeys suggested 
that the frontal lobes play a major role in novelty based behavior(Pribram, 1961). One such experiment 
consisted of a scene comprising of a board having twelve- holes. Two junk: objects were- placed on two^ 
of- these holes and” under one was placed a peanut. The task of the monkey was tchopem the object- 
under which the peanut was placed (a rewarding event). Once any one of the objects. had- beemUfted^b^._ r . 
the monkey then the wooden board was hid from the monkey’s view and another peanut was replaced 
under the object if the monkey had picked the object containing the peanut. After a certain number of 
successful trials in which the monkey picked the object under, which the peanut was kept, the reward 
(i.e the peanut) was moved to the second object. This process was continued until the monkey had 
completed a fixed number of successful trials. After this a new i.e a third junk object, was placed in . 

5 Piaget’s treatise on cognitive behavior in children (Elkind and Flavell, 1969) 
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Figure 8r The four graphs shown in this figure represent the activities of the four decision layer neu- 
rons* When, an response card was initially input to the ART network it was not able to categorize it* 
Eventually" categorization was- randomly achieved as the input card was categorized to category" 3 . 
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Figure 9: The experimental data shown above represents the- number of repetitive 1 trials, required by 
frontal monkeys (straight lines) and normal monkeys (dashed lines) to pick up a novel object in Pribram 7 s 
experiment: 
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the scene. Under this new object was kept the peanut. Thereafter the number of junk objects were 
increased after a fixed number of successful picks by the monkey, until all the- 12 wooden holes were- 
covered. 

Figure 9 shows the results of Pribram’s experiment. The number of repetitive errors (lifting of 
previously rewarding objects) until it picked the new object containing the reward is plotted against 
the number of junk objects. In general it can be seen that frontally damaged monkeys make fewer 
errors than normals, being more- attracted to novel objects 6 . Thus, Pribram concluded that in normal 
monkeys the frontal cortex was responsible for suppression of novelty driven behavior. As in the rhesus 
monkey, the human the frontal lobes also seem to be responsible for gating novelty based cognitive 
decision making. 

4.2 Significance 

Attraction to novelty is essential for exploration and learning new things. However, extreme attraction 
to novelty in a changing environment can be disastrous and should be weighted by reinforcement sig- 
nals. Pribram’s experiment show that if this balance is disturbed (by a damage to frontal lobes that 
would weaken the effect of reinforcement signals), pathological behavior results. Again, comparison of 
pathological and normal subjects reveal “ transparent” aspect in decision making. 


4.3 Neural models 


Levine and Prueitt (Levine and Prueitt, 1989) modeled Pribram’s psychological findings with a neural 
network comprising of gated dipoles. Gated dipoles are devices using transmitter mechanisms for 
comparing present and past signal values. An example of .a gated dipole is shown in figure 10. The 
square synapses in this figure signify such transmitter. packages and their activities are given by, the 
following equations. ...... 


dz x 

dt 

dz 2 

dt 


= ' ai{0 ~ z\) ~ a2V\Z\ 


- ai((3 - z 2 ) - a- 2 y 2 Z 2 



(wj 


(15) 


where z\ and z 2 are the amounts of available transmitter at the. ON and OFF channels respectively, 
a ! is the rate of accumulation of the transmitter to the maximum value of / 3 (the total amount of 

s Details regarding improved performance of normal monkeys in case of larger number of objects and worse performance 
of frontally damaged monkeys in case of few objects have been discussed by Levine and Prueitt ic Pribram( Levine and 
Prueitt 1989; Pribram, 1961) 
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Figure 10: The schematic diagram of a gated dipole 
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transmitter) and a 2 represent the depletion rate of the transmitter. The activities of the input node y x 
and y 2 are given by 


dyi 
dt 
dy 2 
dt 


-gy i + 1 + J 
-gy 2 + 1 


( 16 ) 

( 17 ) 


where g is the decay rate, I is the background arousal signal and J is the input. The activities of 
the remaining nodes x 2 , x 3 .and x 4 are given by 


dx i 
dt 
dx 2 

dt 
dx 3 

dt 
dx 4 

dt 
dx 5 

dt 


-gx l +-byiz\ 

-gx 2 -f by 2 z 2 

-gx 3 + 6[ z i - x i} + 

-gx 4 + b[x 2 - Ei] + 

-gx 5 + (1 - x 5 )x 3 - X 5 X 4 


( 18 ) 

( 19 ) 

(20) 
( 21 ) 
( 22 ) 


where 6 is a constant. 

Now consider when, no input is applied. The amounts of transmitter z x and z 2 are identical, hence 
xs neuron is not active. When the input J is turned “on”, the depletion of transmitter in the left 
channel is more than that in the right channel. Hence z x is less than z 2 . However the greater input (I 
+ J, in the left channel) overcomes the more depleted transmitter to make the left channel more active. 
When the input J is turned off the right channel becomes transiently active and inhibits the activity of 
the Xs neuron until the transmitter of the left hand synapse is completely replenished; This causes the- 
rebound of the x 5 activity. 7 . 

Figure 11 shows the neuraLnetworLpropose<Lby_Levine and.Prueitt to explain Pribram’s psychology 
ical results. This network comprises of a number of gated dipoles coupled to each other (only two-gated^ 
dipoles are shown for simplicity and the subscript i implies a specific gated’ dipole). Each of the gated 
dipole corresponds to an object in the environment. The presence of an object in the environment is 
indicated by the input J (“on” implies presence of an object and “off” implies its absence). The activity 
of x$ f i neurons indicate the possibility that the model is chosing that particular object. Thus the 2:5,; 
node with the highest activity is the object that the model has chosen. A nonspecific arousal input I is 


7 For a detailed discussion of the gated-dipole dynamics refer to Grossberg(Grossberg, 1972) 
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Figure 11: Tie- original model proposed by Levine and Prueitt. 



also given to all the gated dipoles in the network. The activity of the network is modulated by reward 

signal via the reward node whose differential equation is 

du . . . 

— = -gu + r (23) 


where u is the reward node activity and r is the reward signal. This reward node modulates the activity 
of the gated dipole node xs,i as follows 


,i 

dt 


-9x5, i + (1 - x 5 ]i )(euwi ' 5 -f x 3i i) - cx 5i i(x 4ii + Yj ) 

HA 


(24) 


where one of the excitatory terms consists of weighted activity of the reward node multipled by a 
coupling parameter e (between the reward node and the gated dipole node x 5) ;) and the other term is 
the activity of its excitatory neuron X3 The inhibitory term consists of two terms: x $,j which 

constitutes the lateral inhibition of the other gated dipoles and the inhibitory x 4t ; node of the gated 
dipole. The behavior of normal monkeys is modeled by a large value of the coupling factor “e” and a 
small value of the coupling factor models the behavior of frontally lesioned monkeys. In case of normal 
monkeys the reward signals contribute significantly to the activity of the respective Xs } i neuron. The 
effects of the reward signals are stored in long term memory via modification of the weights connecting 
the reward node and the respective 25 } i neuron. The coding of reward signals into weights is given by 

= -fiw 5ii + f2ux 5 ,i (25) 

where, w Sj i is the synaptic strength between the reward node u and x 5 1 ;, f\ and f 2 are positive constants. 

Thus for low values of the coupling parameter e novelty of new objects plays the major role; however 
for large values of e previously rewarding objects can override the attraction of novel objects. . 

4.4 Modifications 

The above. model wasmodifiedanduseddn the cognitive unit of t he s elf- or gani zing.,r p bot.., : T he;^rigjyn..al,.. 
gated-dipole network was replaced by a simpler- version shown in Figure 12. The “ON” 1 channel" oFthisr 
gated-dipole comprises of the transmitter node z\ and the node z\. The- “OFF” channel 1 comprises of 
the transmitter node z 2 and the node x 2 . The gated-dipole node X3 (similar to the X5 node in the 
original-gated- dipole model) combines the inputs of both the “ON” and u OFF” channels.. Using these 
modified (simple) gated-dipole network a modified version of the Levine-Prueitt model was developed. 
This modified model is shown in Figure 13. 

In the Levine-Prueitt model although the effect of a reward could decay through the extinction mech- 
anism in learning, there were no mechanisms to accommodate punishment effects. Though a negative 
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Figure 12: Schematic of the modified gated dipole network. 
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Figure 13; The schematic representation of the modified Levine- Prueitt model. 




reward input may seem to enable the model to incorporate punishment by using the same connection, 
a closer examination reveals that such a strategy would create an ambiguity due to the asymmetry of 
reward and punishment signals. Assume that a synaptic weight has been increased due to a repetitive 
application of reward signals for a particular object. Such a. reinforcement would indirectly increase 
the effect of punishment input as well. The single LTM scheme cannot independently store-in LTM 
reward and punishment signals and thus creates an ambiguity. In order to achieve an independent LTM 
storage, we introduce separate LTM traces using separate reward and punishment synapses. However, 
this solution implies a subtle asymmetry property for encoding reward and punishment by Hebbian 
synapses. A reward signal causes an increase in the post-synaptic neuron activity and thus the LTM 
trace correlates increases in pre and post synaptic activities and reinforces this increase. An asymme- 
try arises in punishment: A punishment signal depresses the post-synaptic activity. The LTM trace 
of punishment correlates increasing pre-synaptic activity with decreasing post-synaptic activity. In a 
Hebbian synapse this asymmetry causes a major problem since the inhibiting synapse cannot effectively 
code the STM traces. Moreover such a coding would also be relatively insensitive to punishment with 
respect to reward (which is not desirable). .To solve this problem, we introduced a new type of learning 
rule. Both the reward and punishment nodes have the same* dynamics: 

-Air + (f?i - r)R -f C\ (26) 

— Aip + (Bi — p)P + (27) 

where Ai , B\ and C\ are constants and R and P are external reward and punishment signals respectively. 
The reward and punishment node dynamics have been modified to shunting type equations. Shunting 
equations were used so as to be able to dissociate learning rates from forgetting rates(Grossberg, 1988). 
While the learning equations for the reward weights remain the same the learning equations for pun- 
ishment weights has been modified to include an auxiliary variable yi. The activity of this variable is 
as follows. 

777 = -Myi + (i?3 - - 9) '“( 28 )" 

The synaptic modification equation is given by: 

= -A 2 {w Pii - 1) + (M p - w Pti )B 2 g(p - 9 1 )g(y i - ff 2 ) (29) 

where 7 1? A 3 , A 2 , # 3 , B 2 0, #i, #2 and M r are constant and g is a linear above threshold function, yi 
reflects past values of postsynaptic neuron to enable the association of the current punishment signal 
with past activities of decisions. 


dr 

dt 

dp 

dt 
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Another problem of the original model was that an external monitor (experimenter) was needed to 
ascertain whether the network chose a particular object from the environment. Such an implementation 
in a continuous environment is not possible and hence an additional network capable of handling this 
situation was implemented using a feed-forward shunting network as shown in figure 14. This network 
receives inputs from the gated dipole circuit and when only one of the nodes is active-, sends a signal to 
the robotic arm (This network is similar to the decision layer discussed in the modified Leven-Levine 
model). 

The differential equation of the gated dipole i.e for £3 (the node £5 in case of original model) was 
also modified to incorporate the various changes. The modified equation is as follows. 


dx 


3 ,i 


dt 


— —Ax 3 ^ + (B — ®3 f »)(®i,z + G\w r iV + - 9)) 

+ GzWp ip i- H f(x 3J - 6 )) 


(30) 


where the excitation terms consist of weighted reward signal multipled by the gain Gi, the excitatory 
input from the gated dipole and a self feedback term G 3 f(x 3 i i — #)■ The inhibitory terms consist of 
weighted punishment signals G 2 tu P) tp, the inhibitory input from the gated dipole £ 2) { and the lateral 
inhibitory term f{x 3 j - 9). The function / is a sigmoid function and is given as 


/(*) = 


(31) 


(1 + z 5 ) 

The differential equations representing the reward and punishment node activities were changed from 
additive type to the shunting type network. This change causes the effect of the punishment and the 
reward signals (i.e rise time of the reward/punishment node) to be much faster than the recovery from 
the signal (i.e the decay time of the reward/purushment node). 

Equations representing the activity of the remaining modified gated dipole nodes are given below. 

• - . • : • dXl f j .. 

dt 
dt 

dz\ i 
dt 

dz2 y i 
dt 

dw Tj i 
dt 


( B -x\ tyzi'i - xi ,ilz 2 ,i 

( 32 ^*- 

^®2,i — %2 ,i(T 

( 33 )’ 

a{0 - z-i.i) - l{I + 

( 34 ) 

a{0 ~ 22,i) “ l Iz 2,i 

( 35 ) 

-M {w Tt i - 1 ) + (M r - w Tii )B 2 g(r - 0i)g(x 3 ,i - $) 

( 36 ) 


This concludes the modifications done to the original Levine-Prueitt model. 
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4.5 Simulation Results 

4.5.1 Simulations on Gated Dipole 

Figure 15a shows a gated dipole neural network. As can be seen from the figure there are two competing 
channels. A nonspecific arousal input “I” is given to both the channels. When the input “J” to the 
gated dipole turns on, the greater input to the left of the channel causes x 3 to be excited. Turning off 
the input J causes a rebound activity to the right channel. The graph showing the input J and output 
activity of the x 3 neuron are give in Fig 15b. 

Each of the gated dipole in the modified model represents an object in the scene. Turning “on” a. 
particular (i.e input to gated dipole) implies that a new object represented by the gated dipole has 
been introduced in the scene. Figure 16a shows the coupling of two gated dipoles in a lateral inhibitory 
fashion. In the simulation (of Figure 16b), one object is input (i.e Ji is turned “on”) and after a while 
a second object is introduced (i.e J 2 is turned on) into the environment. The output graph indicates 
the activity of the 2:3^ node of the gated dipoles which represents the choice made by the model. The 
gated dipole node z 3il - having the highest activity implies that, that object was chosen by the model. 
The input graphs represent when a particular object was introduced in the scene. As can be seen from 
the graph, initially when the first object was input the activity of the z 3 neuron of the first gated dipole 
increased causing a drop in the activity of z 3 node of the second gated dipole thus suggesting that the 
model chose the first object. When the second object was input at time the activity of z 3 node of 
the second gated dipole increased causing the activity of x 3 node of the first gated dipole to drop. The 
greater activity of the second gated dipole implied that the model had chosen the new object over the 
other. Thus these simulation demonstrates a competing situation between input objects in which the 
most recently “input” object was chosen. 

4.5.2 Simulations on the new modified model * 

r * - ■ - Figure 17 illustrates* three- simulations- conducted^, on- the^ new modified. modeL shown^ineEigure^TZ^i ^ 

Simulation results shown are for a model comprising of two interacting gated dipoles coupledTo reward*' 7 ‘ 
and punishment nodes. Similar to the above simulation^ each of the gated dipoles represents an. object 1 
in the- scene. For each pair of graphs shown, the graph on the- left represents' the activity of the 2:3 node 
of the first gated dipole and the graph on the right represents the activity of the z 3 node of the second 
gated dipole. The gated dipole with the highest activity implies, that particular object has been chosen 
by the model. The first graph illustrates how the model handles novel objects. Initially at 100 timeunits* 

8 The units of time is not crucial however the relative units of time between different time constants are important hence 
we have refrained from using any specific time units (e.g secs , msecs etc) and just referred to it as “time units”. 
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Figure 15: a) Represents a simplified gated-dipole network and b) Shows the dynamics of such a- network 
for an rectangular pulse. 
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Figure IT: The three graphs show the simulation involving novelty, reward and punishment. Refer text 
for details.. 
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an object is introduced.. This causes the first gated dipole’s activity to rise above threshold implying 
that the model selected this new object. At 8QQtimeunits a second object is introduced. By this time 
the first object has lost its novelty. This leads to the activation of the second gated dipole and which in 
turn overrides the activity of the first gated dipole. Thus the higher activity of the second gated dipole 
indicates that the model chose the second object (Note that this occurred when no reinforcement either 
positive or negative was applied). This simulation shows how most recent input is more attractive than 
previously unrewarded inputs (novelty). 

The second pair of graphs demonstrates how previously rewarded objects are more attractive than 
recently input objects. The simulation begins similar to the previous case i.e an object is input to the 
scene at lOOtimeunits. However at 500 and 750 timeunits the choice of the first object by the model 
is rewarded. Now when the second object is input at 1000 timeunits it fails to drive the second gated 
dipole above the activity of the first gated dipole implying that the model preferred the first object over 
the second (novel) object. 

The third pair of graphs is a further extension of the previous simulation to include punishment sig- 
nals. The simulation is similar to the previous case until 1200 timeunits. At 1200 timeunits punishment 
signal is given for the choice made by the model. This lowers the activity of the first gated dipole and 
the second gated dipole response rises above that of the first gated dipole. Thus punishing the initial 
choice made by the model cause it to change its choice. In the above cases the punishment and reward 
signals were brief pulses which lasted for 100 timeunits . 

5 The combined model 

We will now describe the combination of these modules with a vision module. For convenience of 
presentation we will consider a general scenario dealing with the initiation of the exploratory behavior. 
Such a scenario (Figure 18) is given.. below. A- vision unit ^represented by the camera which- isdooking^ 
over a two dimensional terrain; a rbbbtic'arm'performs certain actions- orr t hi S' t errainf and^a^co gm tive' 
unit coordinates the camera and the- robotic arm. The camera concentrates, on a particular region and 
identifies a given object in its region of focus. The cognitive unit categorizes various input objects into a 
hierarchy of categories, and makes behavioral decisions regarding the object in focus. These^ behavioral 
decisions are modulated by novelty, habits and external reinforcement signals. Behavioral decision units 
interact with object-representation units and a signal is send to the arm as to “reach” for the object 
whenever appropriate. The schematic view of the various process involved and the interactions between 
them are given in Figure 19a. Figure 19b gives a more detailed block diagram of this system. All' further 
discussions are related to Figure 19. We begin first from the vision unit. 
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Figure 18 : Tie general scenario for the combined system. 
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The general field of view of the camera consists of a topographic array 9 . The neural network for this 
array is a simplified and modified version of the network proposed in (Ogmen and Gange 1990a, 1990b; 
Ogmen and Moussa, inprint). It comprises of a gated-dipole structure that senses temporal novelty of 
the inputs 10 . This early visual network is shown in Figure 20. 

The camera movements are controlled by a competitive layer and a buffering layer as shown in Figure 
21. The CL is a recurrent competitive layer. The output of this layer is buffered by a feedforward 
shunting network (BL) which guarantees that the movements of the eye will not be affected by the 
transients in the decision layer network (CL). However, such a recurrent competition layer has also 
the property of hysteresis, i.e the wining neuron will tend to persist and thus will prevent consequent 
eye movement thereby making the system unable to generate continuous eye movements. To generate 
continuous eye movements, the wining neuron is inhibited by a recurrent feedback coming from the FL 
and DL layers. The FL layer combines sensorial and cognitive signals to regulate the eye movement 
and the DL network adjusts the time course of feedback inhibition to generate a new target position for 
the eye. Thus this network is responsible for moving the camera to the region of focus. The region of 
focus is very similar to the fovea in humans, and its the region of high acuity. We have not introduced 
size and rotation invariance (via e.g. log-polar transform) because we will test the hypothesis that 
the invariances are learned through mental internalization of sensory-motor behaviors and emerge from 
exploratory activity. 

The foveal input is fed to the categorization networks. ART networks as shown in Figure 22 are 
chosen to ensure stable categories in non-stationary environments. The ART network at the bottom 
categorizes the inputs into different object types. The other ART network categorizes the inputs ac- 
cording to categories that are related to the task considered in the present scenario (e.g targeted object, 
junk object). The categorization of the latter ART network is modulated by bias nodes. The bias nodes, 
as discussed before, encode the environmental cues which modulate the categorization. The outputs of” 
these categorization networks are fed to another network which combines object novelty and category 
information to produce a motor behavior.: This* is^shown in Figure 23. . . *. .. . - - 

Object types are fed into a gated dipole structure to sense- the “object novelty”. Slow interneurons 
are used to integrate transient signals generated by rapid eye movements. The outputs of these gated 
dipoles fed into a competitive layer with recurrent on-center off-surround connections. This layer also 
receives outputs from the task related category neurons. The competition generates a winner for the 
target arm position. Again this competitive layer is buffered by a feed-forward on-center off-surround 
9 This represents the retinotopic organization of the visual space. 

l0 We would refer to this novelty as spatial-novelty from here on to differentiate it from object novelty that will be 
discussed later 
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Figure 21: The decision layer that brings- the camera “in focus” on a particular object. 
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Figure 22: The Categorization unit comprising of two ART networks. The first categorization is based 
on the features of the objects. The second categorization is based, on behavioral significance of the 
objects. 
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Figure 23: The object-representation unit is shown in the above figure. The inputs J to the gated 
dipole come from the feature- categorisation layer of the categorisation unit. The inputs from the 
behavioral categorisation layer modulate the- activity of the recurrent layer neurons in this unit. 
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network to filter the transients. This buffering layer is gated by an eye position signal to transfer eye 
position to arm position via a eye-hand coordination network. Thus the combined model would be 
capable of dynamically locating objects present in its environment and to issue a signal to the robot 
arm to pick up the object if it is a “targeted object”. 

5.1 Ambiguity in decision making 

In this section we will discuss two combined networks that resolve decisional ambiguities. The first v 
combined architecture is shown in Figure 24. As discussed before, input features of the object in focus 
are fed to the ART networks. Initially when an object is present in the' environment, no behavioral 
significance is attached to it and an ambiguity exists for the classification of the object as targeted 
or non-targeted object. The architecture of Figure 25 resolves this ambiguity by biasing the targeted 
object via the ambiguity neuron. This “positive bias” is used whenever there is insufficient evidence to 
reach a decision. On the other hand, such a bias forces the system to make a decision for all inputs. A 
consequence of this, is that novelty plays no role in the decision making, since it is always overridden 
by the task related decision. 

A modification to the above architecture is presented in Figure 26. Note that the inputs that drive 
the ambiguity neuron in this- case come from the decision layer neurons of the object representation 
network. This modification enables the novelty of the object to play a part in the behavioral decision 
making process. Now consider the same situation as stated before where an ambiguous object is placed 
in the environment. In the present situation when a behavioral indecision occurs, the competitive layer 
feeding to the motor circuits receive their major inputs from the gated dipoles representing object 
novelty. This signal in turn drives the ambiguity neuron which in turn biases the “targeted” category 
neuron thus proclaiming the object in focus to be a targeted object. Thus in this modified combined 
architecture novelty play an important role in cognitive decision making. 

In the next subsection we will present the various differential equations that describe the- combined 
model-.- 

5.2 Combined neural model 

As mentioned above the combined model comprises of three- major networks: the visual scanning net- 
work, the cognitive network and the object representation network. The two models discussed above 
differ only in the implementation of the ambiguity neuron. Hence we will first discuss the differential 
equations common to both models and finally discuss the ambiguity neuron equation for each case. 
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Figure 24: The detailed architecture for the combined model 
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Figure 25: The- modified architecture for the combined model. 
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The visual scanning network is implemented by an array of gated dipoles described by 
dvxi'i 


dt 


= -Avxi'i + [B - VX\ ,i)(/ -f J i)v Z\ - vxi'ilvz 2 ,i 


dvx 2 ,i _ 


dt 


= -Avx 2ti + (B - vx 2t i)Ivz 2:i - vx 2<i {I + Ji)vzi t i 


(37) 

(38) 


dV jt'' = -Avx^i + (B - vx 3ii )(vx hi + vx 2<i + G 3 f{vx 3<i - 9 )) - vx 3<i {Mli + H f{vx 3 ,j - 9($9) 


dvzi t i 

dt 

dvz 2} j 

dt 


a(P - vz lti ) - ~/(I + Ji)vz lt i 
a(/3 - vz 2<i ) - ~flvz 2}i 


(40) 

(41) 


where vx 2 ,i, vzi,i and vz 2t i represent the activities of x\ x 2 ^ 2 : 3 ^, z lf { and z 2 ^ neurons of 

the modified gated dipole architecture discussed in section 4.3, 

The gated dipole mechanism encodes novelty, and the decision for the new eye position is carried out 
by the recurrent competitive layer (i.e. the vx 3 ^ neurons). Once attentional gating of the input yields 
a behavioral action, an inhibiting signal is issued by the feedback circuit. This feedback inhibition is 
represented by U neurons. The competitive layer is buffered using a feedforward network so that transient 
fluctuations during competition are avoided. This buffering layer known as the decision layer provides 
a signal when a decision has been achieved by the recurrent layer. The output of the this decision layer 
is used to control the eye movement. The decision layer performs the function of moving the eye to a 
particular object and masking peripheral view while categorizing the objects. The differential equation 
for this feedforward shunting decision layer is given below. 


^pi,i 

dt 


-Apu + ( B -p\,i)Wvx 3ii -p u W Y^ vx 3J hi = 1,2,. .15. 


(42) 


The output of this layer is thresholded. The value of the threshold reflects the certainity in the ; 
decision (e.g with B = 1 , 0.9 corresponds to 90% of votes). The behavioral and the object categoriza-; 
tion ART networks categorize the attentionally gated visualTnput and'the e quat ion' for t Re™ r esp ec t ive^ 
networks are given below. The differential equations for- the behavioral ART is given as- follows 


dbx{ 

dt 


dbyj 

dt 


= -Abxi + (B- Cbxi){Ii + £ f{b Vj )bz j<t ) 

j=l 

4 

-Dbx{ f[byj), i = 1,2, 3,4. 
j= i 

12 

= -Abyj + {B - Cbyj)(f(byj) + ^ 

l-l 4 


( 43 ) 
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j = 1, 2. 


(44) 


-Db yj (J2f(byr)+I), 

r ^j 

And the differential equations for the object categorization network are as follows. 

;= i 
4 

— Dxj i = 1,2, 3, 4. (45) 

i=i 

^ = -AfVj + {B - Cf yj ){f{f yj ) 4- £ j(M)Mj) 

t=l 

-Z?/ yi (^/(/y r )+r), j = 1, 2, 3, 4. (46) 

where byj, fx{ and /z/j are the feature and category neuron activities for the. behavioral and object 
categorization networks respectively. 

The transients that arise during competition in the F2 layer of the object i and behavioral catego- 
rization network are also buffered using a decision layer. The differential equation for the feedforward 
shunting decision layer is as follows. 


dp2,i __ 
dt 

-Ap2,i + {B - pi,i)Wb yi - p 2 ,iW ^2 b Vj 

hj = 1,2. 

(47) 




d?z,i _ 
dt 

~Apz,i + (B- P3 ,i)Wf yi -p 3ti W'EfVi 

i l • 

i, j — 1)2, 3, 4. 

(48) 


Internal biases (due to habit) and external biases (due to reinforcement) that influence the catego- 
rization of the behavioral ART network are- incorporated in the bias nodes of the cognitive system. The- 
differential equations for the bias and the habit nodes are given below. 

-n k (aR- + GY,g(nr))}f($k) k = 1 , 2 . ( 49 ) 

r^.k 

~ = Hh k {(J -h k ){$ k -9 2 } + -{$ k -9 2 }~} k = 1,2. (50) 

where Q*. and hk represent the activities of the bias and habit neurons respectively. R + and R~ are 
the positive and negative environmental cues and $*. is the- match signal, similar- to that discussed in 
section 3.3. 
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Once a relevant behavioral categorization has been achieved the feedback neuron is inhibited which 
also signals the eye to initiate scanning. Thus scanning is achieved by inhibiting the currently active 
neuron so that other neurons have the possibility to win the competition. The eye then moves, .to the 
object present at that location. The scanning is achieved by the following two layer of neurons and the 
feedback neuron. The differential equations for this system are as follows. 

= - A s f + (B - f) Arousal - 7 = 1,2. (51) 

j 

= -A,Si + (B - SiHig{f -$ 2 ) i = l,2,..15. (52) 

= ~A t li + {B - li)G 2 g{si-d 2 ) / = 1,2,. .15. (53) 

where / represents the activity of the feedback neuron, Si represents the activity of neurons in the 
feedback layer and represents the activity of neurons in the delay layer, which provides appropriate 
synchrony ofF the feedback signals. 

Behavioral decision influences the object representation network which in turn signals the robot 
arm whether or not to pick the currently focused object. When no definite behavioral categorization 
is achieved, novelty of object classes plays an important role. The object representation system too 
comprises of a gated dipole network so as to model the novelty associated with the various classes of 
objects present in the environment. The differential equations for the object representation system are 
given below. 


df_ 

dt 

dsi 

dt 

dk 

dt 


dcxX'j . 

dt 

dcX2,i 

dt 

dcxz t i 

dt 


= -Acx-i'i + (B - cx x<i )(I 4- qi)cz x ,i - cx Xii Icz 2ti 
= -Acx 2 ,i + {B - cx 2i i)Icz 2ti - cx 2 ,i(I + qi)cz Xii 

= ~Acxz,i + (B - cx 3i i)(cx Xt i + G 3 f{cx 3ii - 0) + G 3 p 2<x ) 
-cx 3> i(cx 2ii + f{cx 3 j -9) + G 3 cp 2i2 ) 


dcz Xl 

dt 

dcz 2<i 

dt 

dqi 

dt 


(54) 

(55) ' 


(56) 


= a(/3 - cz Xt i) - 7 (J -f- Ji)cz Xti (57) 

= a((3 - cz 2ii ) - 7 Icz 2<i (58) 

= -A q qi +(B - qi)p 3 ,i. (59) 

where, similar to the previous case ^ cx 2 ,i, cx 3 t i, cz^ and cz 2f i represent the activities* of various 
neurons of the modified gated dipole network. g t - represents the activity of the slow neuron which 
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encodes the various object types present in the environment. This neuron is also involved in buffering 
the transients produced due to the rapid scanning of the field of view by the robot. 

The recurrent layer in the object representation system also gives rise to transients which need to 
be buffered before they are sent to generate arm signals. If this is not done, the arm precariously moves 
around during the decision process. Hence the output of the buffered decision layer is used to generate 
the arm signal. The differential equation for this shunting feedforward type decision layer is as follows. 


= -Ap 4 ,i + {B - p4,i)Wcx 3 ,i - X! CX 3J i,j = 1,2,. A. (60) 

Sometimes a behavioral decision cannot be made regarding the object currently in focus. This can be 
due to conflicting environmental cues or may be due insufficient knowledge. As discussed in the previous 
section this ambiguity can be resolved in two different ways depending on the behavioral significance. 


^^ 4,1 


da 

dt 

da 

dt 


-Aa + (B - a)^2bxj - a^p 2 ,j 
j j 

-Aa + {B- a)£p 4 j - 

k j 

j = 1,2. k = 1,2, 3,4. 


1 , 2 . 


(61) 


(62) 


where p 2f j and p^j are the decision layer neurons of the behavioral categorization and object represen- 
tation networks. 

The first of the above two equations represents the behavioral scenario where an object in focus is 
deemed as good if no behavioral categorization could be achieved, thus initiating the arm to go and 
pick the object. The second equation represents the scenario in which novelty of the object initiates a 
“pick” signal to the arm when a decisive behavioral categorization is not possible. 


5.3 Simulation Results . T . 

A simulation showing the spatial scanning performed by the robot on two object in its field of view 
is shown in Figure 26. The curves in the Figure graphs the activity of the visual neuron (which 

represents the robots focus of attention). The continuous and the dashed curves thus represent the 
activities due to the two respective objects. Similarly the two different kind of spikes illustrate whether 
the object in focus has been categorized as a targeted object (represented by a continuous spike) or as 
a junk object (represented by dashed spike). 
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20 40 60 80 100 

time 

Figure 26: Simulation for the scanning and categorization of two objects are shown in the above graph. 

The feature vectors for these two objects are (1 0) and (0 1) respectively. The two kinds of curved” 
graphs: the continuous and the dashed represent the object that is currently in focus. . The two types of 
spikes represent the categorizatioi^of the object into^f he target object category (the dashed line) and: the- 
junk category (sparse dashed lines).. Initially when first- object is-introduced-into- the environment- Ther- - 
robot focuses* on: the object and- categorizes* it as a targeted object. A positive reinforcement is given to r 
suggest that the decision of the robot was indeed correct. Subsequently the robot focuses on the second 
object and deems it as a targeted object. A negative reinforcement signal is applied to suggest that 
the object is a junk object. The robot then focus back and forth between the two objects categorizing 
them correctly into target and junk objects. Note that three reinforcement signals (yes, no, yes) are 
required for the robot to correctly categorizes according to first feature. Moreover reinforcement signals 
are applied only when the an object is considered as a targeted object, and the robot picks it up. 
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The scenario consists of two objects introduced one after another in the field of view of the robot. 
Both the objects have complementary types of features i.e one of the objects is a grey wrench and 
the other is a pink spring. The feature vectors describing the two objects are (1001) and .(0110) 
respectively 11 . Initially a grey wrench is introduced, the robot zeros in on it and categorizes it as a 
good object. The robot is rewarded for its action. Consequently the spring appears in the robot field of 
view causing it to focus on this new object. Since there is insufficient evidence regarding the behavioral 
quality of the object, the robot picks this new object up. Subsequently the robot is punished for its 
decision 12 . The scanning mechanism then focuses on to the first object and categorizes it as a targeted 
object. It then scans back to the pink spring but does not pick it up because it too is categorized 
correctly as an junk object. Thus after three reinforcements the robot learnt to categorize object target 
object or a junk object depending on whether it is tool or not. 


6 Conclusion 

&y . * v - 1 ' 

We descrrhedun-this-rep or.t a neural network based robotic system. Unlike traditional robotic systems, 
our approach focussed on non-stationary problems. We indicated^ that self-organization capability is 
necessary for any system to operate successfully in a non-stationary environment. We suggested that 
self-organization should be based on an active exploration process. We investigated neural architectures 
having novelty sensitivity, selective attention, reinforcement learning, habit formation, flexible criteria 
categorization properties and analyzed the resulting behavior (consisting of an intelligent initiation of 
exploration) by computer simulations. 

While various computer vision researchers acknowledged recently the importance of active pro- 
cesses(Swain and Strieker, 1991) the proposed approaches within the new framework still suffer from a 
lack of self-organization(Aloimonos and Bandyopadhyay, 1987; Bajcsy, 1988). : " ’ 

A- self- organizing, neural network based robot (MAVIN) has been recently proposed- (Baloch and- * 
Waxman, 1991). This robot has the capability of position, size-, rotation invariant pattern cafegoma- - ” 
tion, recognition and pavlovian conditioning. Our robot does not have initially invariant processing 
properties. The reason for this is theemphasis we put on active exploration. We maintain the point of 
view that such invariant properties emerge from an internalization of exploratory sensory-motor activity. 
Rather than coding the equilibria of such mental capabilities we are seeking to capture its dynamics to 
understand on the one hand how the emergence of such invariances is possible and on the other hand 


represents the features of the input object where^/i implies whether the object is a tool (10) or not (01) and fi * 
states whether it is pink (10) or grey (01) in color. ^/C 


12 


Note that the robot is rewarded or punished only when' it picks an object. 


/v- X 


54 



the dynamics that lead to these invariances. The second point is crucial for an adaptive robot to acquire 
new invariances in non-stationary environments as. demonstrated by the inverting glass experiments of 
Helmholtz}* 3 ? 

We will introduce Pavlovian conditioning circuits in our future work for the precise objective of 


achieving the generation, coordination and internalization of sequence of actions^/ Fromlhis perspective, 
our system differs from MAVIN in that MAVIN passively associates stimuli (as the passive cat of Held’s 
experiment (Held, 1963, 1965)) while our system, in addition to passive association, acquires active 


association (as the active cat in Held’s experiment (Held, 1963, 1965)). ; 

In our future work we will implement this system in hardware. , Such an implementation will lead 

to long term tests of the system. The natural theoretical extension of the work is the integration of 

/ ■ 

associative learning circuits for the generation of sequences of/ actions and their intelligent coordination, 
and mental internalization. 


13 A similar phenomenon is experienced by many of us who try glasses for the first time. Initially we, are exposed to a 
distorted perception which is consequently corrected by the brain 
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